Greg Detre
14:30 on Friday, September 20, 2002
Prof. Stuart Shieber,� MD G-125
computational linguistics � study of human language using tools + techniques of compsci
applications
transcribing, categorising, translating, controlling, dividing, restoring/editing, annotating, changing
speech recognition, e.g. 1-800-COLLECT � do you accept this call? listens just for �yes�
Descartes, letter to Marquess of Newcastle (1646???) � flexibility of verbal behaviour as the criterion of a �rational soul�
considers the Bladerunner �Turing test� for humans vs replicants of iris dilation � a property completely independent of intelligence
the problem of computational linguistics is �AI-complete� (by analogy with NP-complete)
CL vs NLP
CL � scientific side
NLP � engineering side
Shieber�s areas of interest concern mainly techniques rather than applications
both engineering + scientific
what�s a natural language?
artificial language, e.g. C++, or Esperanto, Hebrew, also Korean alphabet
well-formedness
how useful is grammaticality to NLP applications?
people don�t really speak grammatically for the most part
although different dialects have subtly different grammars
people do follow certain rules when they form ill-formed sentences
there are degrees of well-formedness
epsilon well-formed � distance measure of how well-formed
would a system that could distinguish whether or not something is well-formed benefit?
maybe use a probabilistic method in assessing well-formedness
but this is still assuming a binary distinction of well- or ill-formed
can we actually distinguish well-formedness from meaningfulness?
maybe well-formedness is secondary to grammaticality (e.g. Schank)
Hockett�s theory (in �A manual of phonology�, 1955)
he wants to give an algorithmic model of human speech
pg 4, section 0211-0213
proposes the possibility of a talking dog � �let me add at once that I am entirely serious�
how good is his theory?
it characterises only but not all the well-formed English strings over that vocabulary
you could extend it to all the well-formed strings if you made it huge (because it doesn�t really recurse)
What kinds of languages are describable in general by Hockett�s method?
allows for infinite loops
is it a finite state machine or a context-free grammar?
if you could generate anbn, then you could show that it�s regular??? what�s �regular�???
all you know is what state you�re in, and there�s only a finite number of those � you don�t have a stack, and you don�t have a memory or a count
this is a finite state model, and this is the transition matrix
Is English (viewed as a set of strings) a regular language?
no?
Chomsky invented the Chomsky hierarchy (type 0 � 3) to try to prove that English (and all natural language) isn�t a regular language
wwR
parentheses
anbn (e.g. anti- anti-missile missile missile � but greater than about 2 levels, it ceases to be meaningful English)
relative clauses
the cat likes tuna fish
the cat the dog chased likes tuna fish
the cat the dog the rat bit chased likes tuna fish
maybe what the English language is is anchored to the cognitive limitations of its speakers
if you want to limit English sentences to less than 1 million words, then English contains a finite number of strings, and is therefore regular
but consider C++�???
what about the practical limit of the length of our lives?
you just have to make idealisations (just as in physics)
but the difference here may be that it�s not at all obvious that (unlike the classical physical laws of the universe) there is a determinate answer as to whether a given sentence is grammatical within the English language
how do you prove that English is/not regular?
to prove that English is not regular is hard, because there can be non-regular subsets of a regular language
apply a pumping lemma directly???
closed
apply the homomorphism that turns all nouns to a and all verbs to b (in the cat/dog example above)
the argument is that you end up anbn
intersect English with a regular language � remember that regular languages are closed under intersection with other regular languages
The Chomsky hierarchy
type 0: recursively enumerable
type 1: context-sensitive
type 2: context-free
type 3: regular
invented this to show that Hockett�s theory is inadequate
Are natural languages context-free?
Chomsky � context-free grammars are inappropriate for paralleling natural language
1981 � Gasdar + Pullam(sp???) � claimed that every previous proof that natural language is not context-free was wrong (for empirical or theoretical reasons)
1986 � the issue was closed, because Gasdar + Pullam �???
to show non-regularity, we showed dependency
to show non-context-freeness what pattern of dependency would you have to show?
what did Toad bake ___?
who met John ___? whom did John meet ___?
Norweigan may allow multiple questions
if you have finitely-many dependencies, you can encode them � but if you have arbitrarily many �
if you have cross-serial dependencies (i.e. that aren�t nested within each other, but cross over each other)�
e.g. John, Mary and Bill bought a Ford, a Porsche and a Pontiac respectively
or better: John, Mary and Bill killed himself, herself and himself respectively
if one particular natural language is shown not to be CF, then (if you assume that there is something universal about natural language) chances are that you�ve shown something about natural language in general
limited to 33 � needs to be 20
only 13 are deadly serious
due in by Mon morning, returned by Mon eve
first half class/lecture (lots of discussion), second half project
Star Trek � The ultimate computer � computer has control of the Enterprise, but Kirk convinces the computer to relinquish control
is Esperanto as messy as other �real� natural languages???
apparently yes, like Hebrew
recur (means again, as in recursive) rather than recurse (curse again)